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Covariance matrices of amino acid displacements, commonly used to characterize the large-scale 
movements of proteins, are investigated through the prism of Random Matrix Theory. Bulk univer- 
sality is detected in the local spacing statistics of noise-dressed eigenmodes, which is well described 
by a Brody distribution with parameter f3 c± 0.8. This finding, supported by other consistent indi- 
cators, implies a novel quantitative criterion to single out the collective degrees of freedom of the 
protein from the majority of high-energy, localized vibrations. 
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Introduction. Proteins are biomolecules of essential im- 
portance for biological aspects ranging from structural 
(e.g. viral capsids, cytoskeleton) to biochemical ones 
(e.g. enzymes). The biological functionality of a protein 
often relies on its capability to undergo large-scale con- 
formational changes [TJ [2] . In order to characterize such 
motions, a convenient starting point is the Displacement 
Covariance Matrix (DCM) 6^- = (SriSrj), where Sr is 
the deviation of a protein's backbone (C a ) atom from 
a given reference structure (in this notation i,j indicate 
both amino acid and Cartesian indices) and (• • • ) indi- 
cates the ensemble average. 

The eigenmodes of the DCM with largest eigenvalues, 
corresponding to low-energy excitations, are responsible 
for large-scale, collective motions of the protein. These 
modes play a major role in the molecule's functionality, 
and a handful of them account for a large fraction of the 
overall fluctuation [3 j. The other more numerous eigen- 
modes instead describe high-energy, localized vibrations 
which are more heavily affected by the fine details of 
structure and interactions. 

Atomistic force-field Molecular Dynamics (MD) is a 
widely employed tool to investigate the dynamical prop- 
erties of a protein. The duration of a MD trajectory, 
however, poses a problem in the sampling of the phase 
space [4 j, since the essential space can vary from run to 
run. Nevertheless, the principal components are more ro- 
bust than the high-energy sector of the spectrum and are 
consistent amongst independent simulations. It is there- 
fore highly desirable to single out the statistically signif- 
icant low-energy modes from the bulk of 'noise-dressed' 
ones. Far-reaching implications include the description 
of large-scale fluctuations in terms of a suitable set of 
collective coordinates [5] and the identification of a few 
relevant modes as a basis for coarse-grained models of 
structure and dynamics jg]. 

The purpose of this Letter is to address this issue quan- 
titatively. We propose a novel criterion for statistical sig- 
nificance based on the comparison with local eigenvalue 



statistics from Random Matrix Theory (RMT). Appli- 
cations of RMT to biophysical issues are so far rather 
limited [7HT3]. Our new approach relies on an ensemble 
of DCM, rather than a single instance. This permits to 
characterize in a statistical sense the low-energy subspace 
of a specific protein against a 'null' model with random 
correlations. The sharp onset of RMT-like level spacing 
statistics from the first few modes onwards, along with 
other consistent indicators, signals a clear-cut separation 
of the spectrum in two statistically incompatible regions, 
associated with discordant dynamical properties. 

Preliminaries. In order to collect a DCM ensemble 
we resort to exactly-solvable Elastic Networks Models 
(ENM) [14], rather than other available strategies such 
as the calculation of the Hessian matrix [15] . Indeed it is 
known [16] that the low-energy modes of the spectrum, 
the most interesting for the present work, are well repro- 
duced by ENM with lighter computational effort. 

Our matrix ensemble is obtained from a MD simula- 
tion carried out in [5] on E. Coli Adenylate Kinase, a 
N = 214 amino acid single-chain phosphotransferase pro- 
tein [T7] . During this 50 ns simulation, the molecule ex- 
plores many different free energy minima, or substates. 
Within each substate, the protein fluctuates around a 
well-defined average structure. The 4000 configurations 
(MD frames) of the shortest substate (2 ns) are taken as 
reference structures for the anisotropic /3-Gaussian net- 
work model (/3-GM) [18]. This model retains only two 
centroids per amino acid: the C a atom and a bead rep- 
resenting the sidechain. The free energy profile is approx- 
imated, for small deviations from the reference structure, 
with a network of anisotropic springs of equal strength 
connecting all the centroids within the cutoff distance of 
7.5 A. Due to the quadratic nature of the interactions, 
the DCM C and the eigenspace (A&, \X k )) of a given pro- 
tein structure are obtained inverting the resulting effec- 
tive free energy matrix M (eq. 1-7 of [18 ) as C = M _1 , 
where e|A fe ) = X k \X k ), (X k > X k+1 ). 

Note that the total number of Degrees of Freedom 
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(DoF) is 37V — 6 = 636, since M always presents six 
null modes (corresponding to global symmetries of the 
model): the matrix inversion is obviously meaningful 
only within the subspace orthogonal to ker(M) [18] . The 
local spectral statistics (probing correlations on a scale 
comparable with the mean level spacing) of the obtained 
ensemble of DCM is then compared with the predictions 
of random correlation matrices. Analogous results (not 
included here), have been obtained on other substates of 
the same trajectory and on a MD simulation of G pro- 
tein [19 , a 56 residues long protein, as a preliminary 
validation of the conclusions drawn in the present work 
on different case studies. 

Random Matrix Predictions. The Wishart-Laguerre 
(WL) [20j [21] ensemble of random covariances includes 
NxN matrices W of the form W = (l/T)XX t , where X 
is an N x T matrix containing N time series of T indepen- 
dent elements drawn from a Gaussian distribution with 
zero mean and fixed variance, X ~ N(0,<r 2 ). Since W is 
the covariance matrix of a maximally random data set, it 
is usually the optimal candidate as a 'null model' with the 
lowest degree of built-in information to compare empir- 
ical data with. This program has been implemented on 
financial data [22], internet routers networks [23], EEG 
data [24] and atmospheric correlations [25 among others. 
The requirement of fixed data variance <r 2 guarantees ro- 
tational invariance and thus the exact solvability of WL. 
Nevertheless, it makes the comparison inappropriate in 
the present case, where the empirical 'data matrix' X is 
not accessible and its row variances af (corresponding to 
the Qa entries) turn out to be unevenly spread. We thus 
consider a slightly improved a-WL model ( no n- invariant) 
where different variances are randomly assigned to each 
data row. Details about this improved null model, unim- 
portant for the present discussion, will be presented else- 
where; it suffices to say that the individual statistics is 
rather insensitive to the distribution of variances allo- 
cated to the rows. 

As a typical example of local level statistics, we mainly 
consider the Individual Eigenvalue Spacing (I ES) Sk de- 
fined as [26] Sk = (\k — Afc+i)/ (Xk — A^+i), where the av- 
erage (•) is taken over many samples and clearly (sk) = 1 
for any k. We numerically found that for a-WL the IES 
is well approximated by the Brody [27] one-parameter 
distribution, pp(s) — c#(l + /3)s^ exp(— cps 1+ P), with 
C(3 = [r((2 + /3)/(l + /3))] 1+/3 , and /? « 0.84 ± 0.02 
(obtained from a one-parameter fit). Analytical predic- 
tions for the level distributions in non-invariant ensem- 
bles are generally lacking. The Brody distribution is the 
simplest and most commonly employed fitting formula 
for non-invariant RMT spacings (see e.g. [28] and refer- 
ences therein), interpolating between the limiting Poisson 
(/? = 0) and Wigner (/? = 1) distributions. 

Results and Discussion. The RMT local statistics is 
used here as a null model for two sets of spectra: {A^} 
(the k-th. bare eigenvalue of j-th DCM sample, k = 1 be- 
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FIG. 1: fraction of the protein's fluctuation f n (eq. Q) as a 
function of the first n eigenvalues; the error bars are calculated 
as standard deviations. 
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FIG. 2: relative dispersion of eigenvalues. 

ing the largest) and {fi { k j) := A^ } (3A^ - 6)/Tr[C^]}. The 
/i's are normalized so that their sum reproduces the num- 
ber of DoF. The k-th. eigenvector triples {v^,v^,v^} 
are also extracted. 

The first quantity we analyzed is the average partial 
trace, or cumulative fraction of captured motion, defined 
as: 

^ n 1 U 

L J k=l k=l 

and plotted in fig. [I] As expected, the first 3-4 eigen- 
values capture more than 70% of the protein's overall 
mobility, and this value is typically larger in MD simu- 
lations [29]. The very narrow dispersion validates in a 
statistical sense the persistence of this feature during the 
MD simulation. 

In order to statistically characterize each eigenvalue we 
plot in fig. [2] its relative dispersion (stdev/mean) vs. its 
index k: a low ratio signals a strong localization. The 
/i's display an almost constant ratio, suggesting a certain 
stability in the distribution of the fraction of total mobil- 
ity captured by each mode. In comparison, the bare A's 



3 





FIG. 3: left, level spacing distributions of the first 3 A^; right, 
samples of Xk spacings from the bulk. 
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FIG. 4: left, level spacing distributions of the first 3 right, 
samples of \ik spacings from the bulk. 



ratio rapidly decays to very low values after crossing the 
range spanned by the /i's approximately between the 3rd 
and the 4th eigenvalue, indicating a broad dispersion of 
the largest eigenvalues. 

In contrast with the different dispersion properties of 
absolute and normalized eigenvalues, the spacing distri- 
butions in the bulk show a remarkable universal pattern. 
In figs. [3]and[4j the distributions of the A's and /i's are 
fitted with a Brody distribution. With a x 2 test, the 
Brody hypothesis can be rejected with high confidence 
(1% level) for the first 3 spacings in both cases, while the 
subsequent ones give overall quite a good agreement with 
a fit parameter /3 = 0.8 ±0.1 (standard error among the 
first 100 spacings). The same j3 (within the statistical 
bounds) fits the spacing distribution for the null a-WL 
model, indicating a strong degree of randomness in the 
largest fraction of the DCM spectra. 

The analysis of level spacings is completed with a 
Kolmogorov-Smirnov (KS) test among all pairs of spac- 
ing distributions. Fig. [5] shows the color-coded values of 
the KS distances between the cumulative distributions. 
The same distribution appears to be closely shared by 
the spacings beyond the 4th, while no "partner" is found 
for the top four ones. Therefore, the sought significance 
criterion can be easily expressed as follows: there exists a 
transition between a few collective modes, characterized 



FIG. 5: KS distances among spacing distributions of the A 
(top triangle) and /x (bottom triangle). 
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FIG. 6: KS distances among distributions of the C eigenvec- 
tors. 
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FIG. 7: scalar products of optimal vectors pairs. The green 
line, indicating the 0.9 threshold, is shown as a guide to the 
eye. 



by a non-standard local level statistics, and the bulk of 
modes sharing the same quasi-universal distribution. 
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In order to further validate these indications, in fig. 
[6] we report a color-coded table of KS distances among 
eigenvector distributions. The entry (£, k) represents the 
KS distance between the and cumulative distribu- 
tions, after a proper weighting of the three Cartesian 
components. Again, their distribution stabilizes only 
from the 5 — 6th eigenvector onwards, while the first 4 
clearly display a very poor overlap with the full set. 

As a final check, we divided the MD trajectory in two 
halves and computed the DCM for each sub-trajectory. 
We then applied the method introduced in [5] to deter- 
mine an optimal redefinition of the orthonormal basis 
vectors of the two essential spaces in order to quantify 
the degree of overlap between the two sets. Specifically, 
the redefined basis vectors in one set are ranked in order 
of decreasing overlap with the linear space spanned by the 
vectors in the other set. The statistical significance of the 
first modes is confirmed by the high overlap of the first 
few optimal eigenvector pairs. This criterion, based on 
the properties of the essential space vectors rather than 
the eigenvalues, identifies about 4 relevant modes for the 
conserved subspace with a confidence level of about 90% 
(see fig. [7]). 

Summary and Outlook. In the present paper we ap- 
plied RMT tools to an ensemble of covariance matrices 
obtained in a biophysical context. Making use of an 
anisotropic ENM, applied to each configuration of a MD 
simulation, we produced a large ensemble of covariance 
matrices for Adenylate Kinase. The statistical properties 
of these DCM eigenspaces have been compared with uni- 
versal RMT predictions, such as the Brody level spacing 
distribution of a suitable random ensemble. 

The present study highlights signatures of bulk uni- 
versality and random-like behavior shared by all but the 
first 3-4 eigenmodes of the analyzed DCM ensemble. The 
consequence is a quantifiable separation between the few 
most significant modes, characterized by their own pecu- 
liar statistics, and the bulk of quasi-random ones. Likely 
implications include a more precise identification of the 
collective variables describing the large-scale, function- 
ally relevant fluctuations of biological molecules. 

The marriage between RMT techniques and models 
of protein dynamics is expected to have a broad range 
of applications. Possible directions for future investiga- 
tion include i) the temporal evolution of the significancy 
pattern during the protein's exploration of different sub- 
states; ii) the study of global spectral properties (such as 
density and higher order correlation functions) of DCM 
within a single substate and along the full trajectory; 
iii) correlation structure among functional sub-units of a 
single protein, involving only a finite fraction of highly 
connected components; and finally iv) independent val- 
idation of the method on other protein structures and 
using MD covariance matrices rather than ENM ones. 
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